Fully Quantized Transformer for Machine Translation

123

(a)

Feed-forward

Networks.

(b) Scaled Dot-Product Attention.

(c) Multi-Head Self-Attention.

FIGURE 5.3

(a) Feed-forward Networks. (b) Scaled Dot-Product Attention. (c) Multi-Head Self-

Attention.

second or higher-dimension tensors. For all other operations, such as sums, the computa-

tional cost added by the quantization operation outweighs the benefit of operating with

reduced precision. As a result, they do not quantize such operations. More precisely, all

weights of the Transformer are quantized, excluding biases, due to the biases being summed

with the INT32 output of matrix multiplications, which provide no additional computational

efficiency from being quantized. Furthermore, the memory space of biases is insignificant

compared to the weight matrices. The biases only represent less than 0.1% of total weights.

As for positional embeddings, the authors quantized the embeddings once before training

due to the fixed positional embeddings. The γ weights of LayerNorms are also quantized.

For activations, the authors quantize the sum of the input embeddings with the positional

encodings in both the encoder and decoder. The (Q, K, V ) matrixs within the multi-head

self-attention are quantized. Also, the softmax’s numerator, the softmax’s denominator, the

softmax’s output, and the scaled dot-product attention’s output are quantized, as shown

in Fig. 5.3(b) and Fig. 5.3(c). At the inference stage, the authors adopt the exponential

function to replace the softmax to make the full-precision exponential function a low-bit

format. For the position-wise feed-forward networks, they quantize the output of the ReLUs

and the feed-forward themselves, as shown in Fig. 5.3(a). Finally, for all LayerNorms, we

quantize the numerator xμ, the denominator

σ2 + ϵ, their quotient, and the output of

the LayerNorm.

5.2.3

Tensor Bucketing

The authors adopt tensor bucketing, where they quantize subsets of the tensor with each

set of quantization parameters instead of using a single set of quantization parameters per

quantized tensor. Even though this adds more scalars, the memory cost is insignificant

overall. Furthermore, the authors argue that the added flexibility can significantly alleviate

the precision loss, thanks to all values being mapped to a single low numerical precision

domain. This tensor bucketing method uses several subsets equal to the output dimension